Today we will learn some useful data manipulation functions in R. A lot of this stuff can be done with base R functions, but if you don’t know how to do it in base R, using tidyverse is probably easier to learn! I will be around to help, but please also help each other where possible.

Resources

There are some useful cheat sheets available at https://rstudio.com/resources/cheatsheets/:
- Data Import Cheat Sheet - Data Transformation Cheat Sheet - Factors with forcats Cheat Sheet

And this book on R for data science is useful too: http://r4ds.had.co.nz/

Outline

Here is a list of things we will learn today:

  1. Filtering observations using filter()
  2. Selecting variables using select()
  3. Changing data from wide to long format and vice versa, using pivot_longer() and pivot_wider()
  4. Merging different dataframes together with the bind() and join() functions
  5. Using the pipe operator %>% to write tidy code and group observations with group_by()
  6. Calculating new variables with mutate() and summarise()
  7. Reordering factor levels for nicer plots

How we will work today

There are lots of functions in different packages, and altogether they are called tidyverse. The tidyverse packages we will use today are dplyr, tidyr, and forcats, but you can download all of them in one go using install.packages("tidyverse"). We will not use every single function, or different way of using them, but instead focus on the main ones that will hopefully come in handy!

I have not included the output of the code here, so you will have to run it yourself :)

There will be some exercises, formatted like this, so you can get some practice in.

library(tidyverse)

We will load in some data first. As it was International Women’s Day on Sunday, we will use data from the Women’s Football World Cups. For a description of the variables see https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-07-09.

This is how to read in the data from github:

wwc_outcomes <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-09/wwc_outcomes.csv")
squads <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-09/squads.csv")
codes <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-09/codes.csv")

I usually look at the data to start with, mainly using str(), summary() or head(). To understand what every function does, it might be helpful to run these before and after you have made changes to the data!

We will start using the dataframe wwc_outcomes.

str(wwc_outcomes)
summary(wwc_outcomes)
head(wwc_outcomes)

Use the above functions to look at the dataframes squads and codes too.

The functions follow the normal form in R, for example, we can change column names using rename():

wwc_outcomes2 <- rename(wwc_outcomes, gameID = yearly_game_id)
head(wwc_outcomes2)

This can be useful when we read in data with long, complicated column names, for example.

What do the functions distinct() and arrange() do? Have a look on the data transformation cheat sheet.

1. filter()

One of the simplest and most useful functions is filter().

filter() will retain or remove rows in your dataframe, according to conditions you set:

ggplot(data = squads, aes(x = player, y = goals)) +
  geom_col() +
  coord_flip()  # Makes a bar plot with ggplot, but we can't see anything in there as there are so many players!

goals30 <- filter(squads, goals > 30) 
# this creates a new dataframe called goals30, which are all rows from dataframe squads where variable goals is above 30

ggplot(data = goals30, aes(x = player, y = goals)) +
  geom_col() +
  coord_flip()

Exercise: Create a dataframe from squads where all caps are above 150, and check it worked using ggplot()

We can also look for exact values:

filter(wwc_outcomes, score == 13)
## In this case, we will see the output straight away, rather than saving it in a separate dataframe. This is useful if we want to look into our data a bit

The structure for NA values is slightly different (x == NA doesn’t work):

summary(squads$goals)
# goals contains 32 rows with NA values

goalsNA <- filter(squads, is.na(goals))
# In this case we keep those rows where goals is NA

summary(goalsNA$goals)

Can you create a dataframe where the variable goals is never NA? You cna use the Data Transformation cheat sheet.

In the wwc_outcomes dataframe, retain only observations where team is ENG (or another team of your choosing).

Can you also create a new dataframe with three different teams? You might need to use this operator: %in%

2. select()

The select() function can be used to include or exclude columns from the dataframe:

str(wwc_outcomes)

wwc_outcomes2 <- select(wwc_outcomes, -team_num) 
# this will make a new dataframe, without the variable team_num

str(wwc_outcomes2)

wwc_outcomes3 <- select(wwc_outcomes2, year, team) 
str(wwc_outcomes3)
# this will make a new dataframe, with only the variables team and win_status. Check the number of observations
wwc_outcomes3 <- distinct(wwc_outcomes3) ## this step will remove all duplicate values, so we can see which team played in which year

Using wwc_outcomes, create a new dataframe where all variables start with t (use the cheatsheet).

There are lots of other ways of using filter() and select(), all listed on the cheat sheets.

3. Reshaping data

We can use tidyverse for reshaping data. As an example, we will extract the teams who made it into the finals each year, and create columns that specficy who won and who lost:

finals <- filter(wwc_outcomes, round == "Final")
finals <- select(finals, year, team, win_status)
finals_wide <- pivot_wider(finals, names_from = win_status, values_from = team)
finals_wide

## Let's have a look who won most during the finals:
ggplot(finals_wide, aes(x = Won)) +
  geom_bar()

Can you reverse what you did, and put the data back into the original format? Call the dataframe finals_long and use the function pivot_longer().

4. Merging dataframes

tidyverse has the function bind_rows() which is similar to rbind. It does as it says - binding rows of dataframes together. The key thing is that the column headings should be the same in both dataframes.

It is also possible to merge two dataframes based on one or more columns.

In the wwc_outcomes dataframe, we have the teams, but we might not know what all the short versions mean. However, they are stored in the codes dataframe:

head(codes)
head(wwc_outcomes)

## We might not know what all the team codes stand for:
ggplot(wwc_outcomes, aes(x = team)) +
  geom_bar() +
  coord_flip()

## but we have them stored in the codes dataframe that we read in at the start:
wwc_outcomes_c <- left_join(wwc_outcomes, codes, by = "team")

## Now we can see which teams played the most games:
ggplot(wwc_outcomes_c, aes(x = country)) +
  geom_bar() +
  coord_flip()

The function left_join() retains the dataframe that we specify first, and adds in values from the dataframe we specify next. There are different join() functions and they do differnt things. When I use these functions I check and double check what exactly each function does, as it can get a bit confusing:

all_countries <- right_join(wwc_outcomes, codes, by = "team")
# right_join will keep all the data from codes, and add the data from wwc_outcomes in front of it

What does anti_join() do? Try it with wwc_outcomes and codes.

5. Piping %>% and group_by()

If we need to do a lot of data manipulation in a dataframe, we can use piping, with the piping operator %>%. Have a look at this shockingly untidy example from earlier:

finals <- filter(wwc_outcomes, round == "Final")
finals <- select(finals, year, team, win_status)
finals <- pivot_wider(finals, names_from = win_status, values_from = team)

If we use piping, we can reduce the code and have less opportunities for misspellings etc. I have written the code below with pipes, but it does exactly the same as the code above:

wwc_outcomes %>% 
  filter(round == "Final") %>% 
  select(year, team, win_status) %>% 
  pivot_wider(names_from = win_status, values_from = team) ->
  finals

# rather than having to write wwc_outcomes at the start of each bracket, we don't need to write it apart from the first part of the code
# to save what we are doing, we need to end the code with ->, followed by the name of the new dataframe on the next line

Compare the code with and witout pipe closely, so you understand how to structure the pipe.

Have a look at this code:

wwc_outcomes2 <- filter(wwc_outcomes, round == "Group")
wwc_outcomes3 <- select(wwc_outcomes2, team, score, win_status)
wwc_outcomes4 <- sample_n(wwc_outcomes3, 20)

Can you turn the above horrible code into a nice pipe? Hint: the keyboard shortcut for the pipe operator is Ctrl + Shift + M

6. mutate() and summarise()

We can use mutate() to calculate new variables, based on others we already have:

squads <- mutate(squads, gpc = goals / caps)
# the new variable is called gpc. Caps were the number of international games played

squads %>% 
  filter(gpc > 0.5) %>% 
  ggplot(aes(x = player, y = gpc)) +
  geom_col() +
  coord_flip()

mutate() can also change variables according to conditions we set:

str(squads)
squads <- mutate_if(squads, is.factor, as.character)
str(squads)

We can also group observations with piping, and calculate summary statistics:

squads %>%
  group_by(country) %>%
  summarise(all_goals = sum(goals)) %>% 
  filter(all_goals > 200)

Can you calculate the number of games each team played, in wwc_outcomes?

7. Reordering factors

R orders factors alphabetically. Often we do not want to display bar plots in this way, but by a frequency or another value. For example, we made the following plot earlier:

squads %>% 
  filter(gpc > 0.5) %>% 
  ggplot(aes(x = player, y = gpc)) +
  geom_col() +
  coord_flip()

## in reverse alphabetical order, because we used coord_flip()

We want to order the plot by gpc really. This is where fct_reorder() comes in, which we can use within the ggplot() function:

squads %>% 
  filter(gpc > 0.5) %>% 
  ggplot(aes(x = fct_reorder(player, gpc), y = gpc)) +
  geom_col() +
  coord_flip()

Can you reverse the order, so the highest gpc is at the bottom? Use the Factors with forcats Cheat Sheet.

Sometimes we might want to order by the frequency, particularly for barplots. Another plot we made earlier:

ggplot(finals_wide, aes(x = Won)) +
  geom_bar()

We can use the fct_infreq() function to reorder:

ggplot(finals_wide, aes(x = fct_infreq(Won))) +
  geom_bar()

8. A bit more on factors

Sometimes we want to change the names of a factor, because they are too long, or are incorrect. We can check and change individual factor levels like this:

levels(wwc_outcomes$win_status)
levels(wwc_outcomes$win_status) <- c("Lost", "Tie", "Winners YAAAY")
levels(wwc_outcomes$win_status)[levels(wwc_outcomes$win_status) == "Lost"] <- "Losers :(" 
## this just changes one factor level

If we want to display factor levels over 2 lines because they are so long, we can change them in this way using \n:

ggplot(data = goals30, aes(x = fct_reorder(player, goals), y = goals)) +
  geom_col() +
  coord_flip() +
  labs(x = "Player", y = "Goals")

levels(goals30$player)[levels(goals30$player) == "Madeleine Ngono Mani"]<-"Madeleine\nNgono Mani"

ggplot(data = goals30, aes(x = fct_reorder(player, goals), y = goals)) +
  geom_col() +
  coord_flip()

And that’s it! You are now an expert at data wrangling, well done!